Squibs: Reliability Measurement without Limits

نویسندگان

  • Dennis Reidsma
  • Jean Carletta
چکیده

In computational linguistics, a reliability measurement of 0.8 on some statistic such as κ is widely thought to guarantee that hand-coded data is fit for purpose, with 0.67 to 0.8 tolerable, and lower values suspect. We demonstrate that the main use of such data, machine learning, can tolerate data with low reliability as long as any disagreement among human coders looks like random noise. When the disagreement introduces patterns, however, the machine learner can pick these up just like it picks up the real patterns in the data, making the performance figures look better than they really are. For the range of reliability measures that the field currently accepts, disagreement can appreciably inflate performance figures, and even a measure of 0.8 does not guarantee that what looks like good performance really is. Although this is a commonsense result, it has implications for how we work. At the very least, computational linguists should look for any patterns in the disagreement among coders and assess what impact they will have.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Squibs and Discussions Reliability measurement without limits

In computational linguistics, a reliability measurement of 0.8 on some statistic such as κ is widely thought to guarantee that hand-coded data is fit for purpose, with 0.67 to 0.8 tolerable, and lower values suspect. We demonstrate that the main use of such data, machine learning, can tolerate data with low reliability as long as any disagreement among human coders looks like random noise. When...

متن کامل

Squibs And Discussions - Evaluating Discourse And Dialogue Coding Schemes

Agreement statistics play an important role in the evaluation of coding schemes for discourse and dialogue. Unfortunately there is a lack of understanding regarding appropriate agreement measures and how their results should be interpreted. In this article we describe the role of agreement measures and argue that only chance-corrected measures that assume a common distribution of labels for all...

متن کامل

Interobserver reliability of neck-mobility measurement by means of the flock-of-birds electromagnetic tracking system.

OBJECTIVE To establish the interobserver reliability for measuring neck mobility in human subjects by means of an electromagnetic tracking device, the Flock-of-Birds system. METHODS Two observers independently and in random order assessed the cervical range-of-motion in 30 subjects with a dysfunction in the neck and shoulder region (symptomatic subjects) and 30 subjects without known patholog...

متن کامل

A comparison of the validity and reliability between a digital radiographic imaging system and manual method in measuring the Cobb angle

Methods Twenty adolescents with scoliotic curvatures were chosen to participate in the study based on convenience, without predilection for gender, age, type or location. Images of the curvatures were examined by 15 trained observers to estimate the Cobb angle variability, as well as intraand inter-observer variations. Each image was measured three times at a minimum interval of one week betwee...

متن کامل

Squibs and Discussion 1 Introduction

The contrast in the semantics of Turkish object NPs with and without overt case morphology has received some attention in the literature (see, e.g., Dede 1986, Knecht 1986, Tura 1986, Enç 1991). The NPs with overt case morphology in (1a) and (1b) yield specific readings, whereas the NPs without case morphology in (1c) and (1d) are nonspecific. In this squib, I will examine nonspecific objects a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008